Forest fires are among the most common ecological disasters throughout the world. In the US, the ongoing record-breaking forest fires in California have caused great concerns. Although forest fires play an important role in the natural cycle, it poses devastating threats to local communities and biodiversity. Negative effects of forest fires include pollution, deforestation, health, as well as economic losses in terms of forest fire suppression. To better predict forest fires, forest fires modeling has been used in an attempt to characterize fire behavior and improve the safety of firefighters. Thus, we would like to identify significant variables in forest fires and develop a simple model to help understand the patterns of forest fires.
The first question we would like to answer is what are the influential factors in predicting forest fires. Initially, we would like to use visualization to explore the data set, and then then use regression analysis to investigate significant variables. After learning shiny app, we decided to incorporate it into the visualization section to better illustrate the time-dependent distribution. As the course progressed, we were exposed to various modeling methods for machine learning and formed a more concrete idea about use modeling to predict forest fires.
The forest fire data we use is from Montesinho Natural Park in the northeast region of Portugal collected from January 2000 to December 2003 with a total of 517 entries. (weblink:http://archive.ics.uci.edu/ml/datasets/Forest+Fires) Features of forest fires include spatial information of forest fire according to a 9x9 grid map, temporal information, the forest Fire Weather Index (FWI), and meteorological information.
For temporal information, month and day were selected to take into account of both natural factor and human cause; four weather attributes—FFMC, DMC, DC, and ISI—were chosen for FWI; meteorological information includes temperature, relative humidity, wind speed, outside rain, and the burned area of the forest.
# Add season category
fire <- read.csv("forestfires.csv", header=TRUE, sep = ",")
fire$season <- rep("spring", 517)
for (i in 1:517){
if (fire$month[i] %in% c("feb","jan","dec")) fire$season[i] <- "winter"
if (fire$month[i] %in% c("oct","nov","sep")) fire$season[i] <- "autumn"
if (fire$month[i] %in% c("aug","jul","jun")) fire$season[i] <- "summer"
}
fire$season <- as.factor(fire$season)
fire$season.cat <- rep(0, 517)
for (i in 1:517){
if (fire$season[i] == "summer") {
fire$season.cat[i] <- 1
}
if (fire$season[i] == "autumn") {
fire$season.cat[i] <- 2
}
if (fire$season[i] =="winter") {
fire$season.cat[i] <- 3
}
}
head(fire)
## X Y month day FFMC DMC DC ISI temp RH wind rain area season season.cat
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0 spring 0
## 2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0 autumn 2
## 3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0 autumn 2
## 4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0 spring 0
## 5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0 spring 0
## 6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0 summer 1
Our dataset was good for analysis, no additional scraping was needed. Thus, for the cleanup, we were more focused on identifying any outliers or potential influential points that may affect our regression model fitting.
Since one regression model we were trying to fit is linear regression model which has an assumption of normality and our outcome burning area is not normally distributed seeing the below histogram on the left, thus we did a log transformation on the burning areas above 0 to approximate normal distribution. Next, we identified ourliers and potential influential points in our dataset based on leverage and Cook’s Distance.
# Area log transformation (for area>0)
hist(fire$area,40, main = "Histogram of area", xlab = "Area")
fire["logarea"] <- ifelse(fire$area >0, log(fire$area), NA)
ggplot(data=fire, aes(x=logarea))+
geom_histogram(aes(y=..density..), col="black",fill="white")+
stat_function(fun=dnorm, args = list(mean=mean(fire$logarea, na.rm = TRUE), sd = sd(fire$logarea, na.rm=TRUE)),col="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 247 rows containing non-finite values (stat_bin).
We can see that after log transformation, the log(area) is almost distributed.
# Outliers / Cook's Distance
area_posit <- fire[which(fire$area>0),]
mod_lin <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=area_posit[c(seq(1,12),15,16)])
ols_plot_cooksd_bar(mod_lin) #number 262 datapoint in this data set, which is id=500
ols_plot_resid_lev(mod_lin)
ols_plot_dffits(mod_lin)
cooksd <- cooks.distance(mod_lin)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd+0.001, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")
# Get rid off the outliers
influential = which(cooksd>4*mean(cooksd, na.rm=T)) # influential points
influ <- area_posit[influential, ] #all data for influential points
influ
## X Y month day FFMC DMC DC ISI temp RH wind rain area season
## 139 9 9 jul tue 85.8 48.3 313.4 3.9 18.0 42 2.7 0.0 0.36 summer
## 144 1 2 jul sat 90.0 51.3 296.3 8.7 16.6 53 5.4 0.0 0.71 summer
## 148 8 3 sep tue 84.4 73.4 671.9 3.2 24.2 28 3.6 0.0 0.96 autumn
## 200 2 4 sep mon 63.5 70.8 665.3 0.8 22.6 38 3.6 0.0 11.32 autumn
## 239 6 5 sep sat 92.5 121.1 674.4 8.6 25.1 27 4.0 0.0 1090.84 autumn
## 267 6 5 aug tue 94.3 131.7 607.1 22.7 19.4 55 4.0 0.0 0.17 summer
## 416 8 6 aug thu 94.8 222.4 698.6 13.9 27.5 27 4.9 0.0 746.28 summer
## 421 8 8 aug wed 91.7 191.4 635.9 7.8 26.2 36 4.5 0.0 185.76 summer
## 470 6 3 apr sun 91.0 14.6 25.6 12.3 13.7 33 9.4 0.0 61.13 spring
## 474 9 4 jun sat 90.5 61.1 252.6 9.4 24.5 50 3.1 0.0 70.32 summer
## 480 7 4 jul mon 89.2 103.9 431.6 6.4 22.6 57 4.9 0.0 278.53 summer
## 500 7 5 aug tue 96.1 181.1 671.2 14.3 27.3 63 4.9 6.4 10.82 summer
## 514 2 4 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 54.29 summer
## season.cat logarea
## 139 1 -1.02165125
## 144 1 -0.34249031
## 148 2 -0.04082199
## 200 2 2.42657107
## 239 2 6.99470332
## 267 1 -1.77195684
## 416 1 6.61510086
## 421 1 5.22445552
## 470 0 4.11300274
## 474 1 4.25305625
## 480 1 5.62952577
## 500 1 2.38139627
## 514 1 3.99434005
dat <- area_posit[-influential,] #get rid off the influential points
Seeing the potential influential points, id 500 has the maximum value of FFMC, high value of DMC that is above 3rd Quantile; id 470 has very small values of DC; id 267 has high value of ISI; id 239 has the maximum value of burning area with the given the parameters, etc.
We created a new data set excluding these points and wanted to see if fitting our regression models without these influential points helped improve our model fitting and there’re no missing values in our dataset.
# New Dataset
new_dat <- rbind(fire[which(fire$area==0),], dat) #join the area_positive w/o influential points to the data w/ area=0
new_dat$burn <- ifelse(new_dat$area==0,0,1) #to get the new dataset without the influential points
head(new_dat)
## X Y month day FFMC DMC DC ISI temp RH wind rain area season season.cat
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0 spring 0
## 2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0 autumn 2
## 3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0 autumn 2
## 4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0 spring 0
## 5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0 spring 0
## 6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0 summer 1
## logarea burn
## 1 NA 0
## 2 NA 0
## 3 NA 0
## 4 NA 0
## 5 NA 0
## 6 NA 0
md.pattern(new_dat[,c(1:15,17)]) #no missing value
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## X Y month day FFMC DMC DC ISI temp RH wind rain area season season.cat burn
## 504 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 504 0
## 0
summary(new_dat)
## X Y month day
## Min. :1.000 Min. :2.000 Length:504 Length:504
## 1st Qu.:3.000 1st Qu.:4.000 Class :character Class :character
## Median :4.000 Median :4.000 Mode :character Mode :character
## Mean :4.633 Mean :4.288
## 3rd Qu.:7.000 3rd Qu.:5.000
## Max. :9.000 Max. :9.000
##
## FFMC DMC DC ISI
## Min. :18.70 Min. : 1.10 Min. : 7.9 Min. : 0.000
## 1st Qu.:90.28 1st Qu.: 70.53 1st Qu.:441.8 1st Qu.: 6.500
## Median :91.60 Median :108.30 Median :664.2 Median : 8.400
## Mean :90.71 Mean :111.10 Mean :549.0 Mean : 9.028
## 3rd Qu.:92.90 3rd Qu.:142.40 3rd Qu.:714.3 3rd Qu.:10.725
## Max. :96.20 Max. :291.30 Max. :860.6 Max. :56.100
##
## temp RH wind rain
## Min. : 2.20 Min. : 15.00 Min. :0.400 Min. :0.000000
## 1st Qu.:15.40 1st Qu.: 32.75 1st Qu.:2.700 1st Qu.:0.000000
## Median :19.20 Median : 41.50 Median :4.000 Median :0.000000
## Mean :18.80 Mean : 44.28 Mean :4.001 Mean :0.009524
## 3rd Qu.:22.73 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.000000
## Max. :33.30 Max. :100.00 Max. :9.400 Max. :1.400000
##
## area season season.cat logarea
## Min. : 0.000 autumn:185 Min. :0.000 Min. :-2.4080
## 1st Qu.: 0.000 spring: 64 1st Qu.:1.000 1st Qu.: 0.7608
## Median : 0.420 summer:224 Median :1.000 Median : 1.8083
## Mean : 8.196 winter: 31 Mean :1.363 Mean : 1.7885
## 3rd Qu.: 6.365 3rd Qu.:2.000 3rd Qu.: 2.6596
## Max. :212.880 Max. :3.000 Max. : 5.3607
## NA's :247
## burn
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5099
## 3rd Qu.:1.0000
## Max. :1.0000
##
We used R shiny app to visualize our data, getting a rough idea of frequency, size, and distribution of forest fires. We were interested in weather observations that are associated with the FWI index: temperature, relative humidity, wind speed, and rain. For each of these variables, a plot is generated to describe its relationship with forest fires by month. To get a more direct interpretation, we also plotted the count and area of forest fires by variables of interest. The R shiny app is in the github repository and is also shown on google site. The following pictures are the screenshots of the shiny app.
For statistical methods, we used linear regression model and logistic regression model based our data. Since our outcome in our data is burning area which is a continuous variable and is not a count, we considered linear regression model; then we categorize the outcome as a binary outcome – burn = 1 if burning area>0, burn = 0 if burning area =0, logistic regression model is being used here. By comparing these regression models, we mainly looked at \(R^2\), adjusted \(R^2\), square root of MSE (mean squared error) for linear regression models, AUC, GOF (goodness-of-fit) test for logistic regression models and AIC and BIC for all.
After learning about machine learning, we selected important variables in logistic regression models and used them to predict whether or not there’s going to have forest fires. We mainly focused on accuracy, specificity, sensitivity and AUC to compare machine learning models.
Below are the codes for linear, logistic regression models and machine learning models.
# Fit linear regression model to area>0 since this is normally distributed
area_posit <- fire[which(fire$area>0),]
summary(area_posit)
## X Y month day
## Min. :1.000 Min. :2.000 Length:270 Length:270
## 1st Qu.:3.000 1st Qu.:4.000 Class :character Class :character
## Median :5.000 Median :4.000 Mode :character Mode :character
## Mean :4.807 Mean :4.367
## 3rd Qu.:7.000 3rd Qu.:5.000
## Max. :9.000 Max. :9.000
## FFMC DMC DC ISI
## Min. :63.50 Min. : 3.2 Min. : 15.3 Min. : 0.800
## 1st Qu.:90.33 1st Qu.: 82.9 1st Qu.:486.5 1st Qu.: 6.800
## Median :91.70 Median :111.7 Median :665.6 Median : 8.400
## Mean :91.03 Mean :114.7 Mean :570.9 Mean : 9.177
## 3rd Qu.:92.97 3rd Qu.:141.3 3rd Qu.:721.3 3rd Qu.:11.375
## Max. :96.20 Max. :291.3 Max. :860.6 Max. :22.700
## temp RH wind rain
## Min. : 2.20 Min. :15.00 Min. :0.400 Min. :0.00000
## 1st Qu.:16.12 1st Qu.:33.00 1st Qu.:2.700 1st Qu.:0.00000
## Median :20.10 Median :41.00 Median :4.000 Median :0.00000
## Mean :19.31 Mean :43.73 Mean :4.113 Mean :0.02889
## 3rd Qu.:23.40 3rd Qu.:53.00 3rd Qu.:4.900 3rd Qu.:0.00000
## Max. :33.30 Max. :96.00 Max. :9.400 Max. :6.40000
## area season season.cat logarea
## Min. : 0.09 autumn:102 Min. :0.00 Min. :-2.4079
## 1st Qu.: 2.14 spring: 24 1st Qu.:1.00 1st Qu.: 0.7608
## Median : 6.37 summer:125 Median :1.00 Median : 1.8516
## Mean : 24.60 winter: 19 Mean :1.43 Mean : 1.8448
## 3rd Qu.: 15.42 3rd Qu.:2.00 3rd Qu.: 2.7358
## Max. :1090.84 Max. :3.00 Max. : 6.9947
head(area_posit)
## X Y month day FFMC DMC DC ISI temp RH wind rain area season
## 139 9 9 jul tue 85.8 48.3 313.4 3.9 18.0 42 2.7 0 0.36 summer
## 140 1 4 sep tue 91.0 129.5 692.6 7.0 21.7 38 2.2 0 0.43 autumn
## 141 2 5 sep mon 90.9 126.5 686.5 7.0 21.9 39 1.8 0 0.47 autumn
## 142 1 2 aug wed 95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.55 summer
## 143 8 6 aug fri 90.1 108.0 529.8 12.5 21.2 51 8.9 0 0.61 summer
## 144 1 2 jul sat 90.0 51.3 296.3 8.7 16.6 53 5.4 0 0.71 summer
## season.cat logarea
## 139 1 -1.0216512
## 140 2 -0.8439701
## 141 2 -0.7550226
## 142 1 -0.5978370
## 143 1 -0.4942963
## 144 1 -0.3424903
# Scatterplot with lowess curve
ggplot(data=area_posit, aes(x=DMC, y=logarea))+
geom_point()+
geom_smooth(method = "loess",se=FALSE)+
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
# Fit linear regression model with full dataset
mod_lin <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=area_posit[c(seq(1,12),15,16)])
summary(mod_lin) # DMC and DC are being statistically significant
##
## Call:
## lm(formula = logarea ~ X + Y + month + day + FFMC + DMC + DC +
## ISI + temp + RH + wind + rain, data = area_posit[c(seq(1,
## 12), 15, 16)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2131 -0.9790 0.0380 0.8145 4.4916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2117947 3.9629695 0.306 0.760033
## X 0.0515407 0.0473061 1.090 0.277001
## Y -0.0944394 0.0960197 -0.984 0.326315
## monthaug 0.7546454 1.3660544 0.552 0.581161
## monthdec 1.9052997 1.1644253 1.636 0.103075
## monthfeb -0.2741962 0.9130864 -0.300 0.764207
## monthjul 0.0373724 1.1677970 0.032 0.974496
## monthjun -0.4647905 1.0983596 -0.423 0.672545
## monthmar -0.4325527 0.8534419 -0.507 0.612730
## monthmay 1.2928830 1.7376890 0.744 0.457578
## monthoct 3.2689080 1.6666212 1.961 0.050970 .
## monthsep 2.0552667 1.5337220 1.340 0.181475
## daymon -0.0005960 0.3603204 -0.002 0.998682
## daysat 0.6039482 0.3413894 1.769 0.078128 .
## daysun 0.4058246 0.3367923 1.205 0.229382
## daythu 0.2295388 0.3752345 0.612 0.541292
## daytue 0.3295887 0.3519106 0.937 0.349906
## daywed 0.0269697 0.3705798 0.073 0.942043
## FFMC 0.0071617 0.0433667 0.165 0.868969
## DMC 0.0094934 0.0028344 3.349 0.000938 ***
## DC -0.0051541 0.0020759 -2.483 0.013706 *
## ISI -0.0309518 0.0372786 -0.830 0.407190
## temp 0.0393276 0.0348704 1.128 0.260504
## RH -0.0004741 0.0100132 -0.047 0.962274
## wind 0.0517875 0.0584917 0.885 0.376822
## rain 0.0457141 0.2420749 0.189 0.850373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.499 on 244 degrees of freedom
## Multiple R-squared: 0.1254, Adjusted R-squared: 0.0358
## F-statistic: 1.4 on 25 and 244 DF, p-value: 0.1036
tidy(mod_lin)
## # A tibble: 26 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.21 3.96 0.306 0.760
## 2 X 0.0515 0.0473 1.09 0.277
## 3 Y -0.0944 0.0960 -0.984 0.326
## 4 monthaug 0.755 1.37 0.552 0.581
## 5 monthdec 1.91 1.16 1.64 0.103
## 6 monthfeb -0.274 0.913 -0.300 0.764
## 7 monthjul 0.0374 1.17 0.0320 0.974
## 8 monthjun -0.465 1.10 -0.423 0.673
## 9 monthmar -0.433 0.853 -0.507 0.613
## 10 monthmay 1.29 1.74 0.744 0.458
## # … with 16 more rows
#Fit linear regression again, now without the influential points and outliers
mod_lin2 <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=dat)
summary(mod_lin2)
##
## Call:
## lm(formula = logarea ~ X + Y + month + day + FFMC + DMC + DC +
## ISI + temp + RH + wind + rain, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9746 -0.9080 -0.0007 0.7724 3.5148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.987342 4.994567 0.198 0.843467
## X 0.032411 0.044763 0.724 0.469771
## Y -0.101419 0.092957 -1.091 0.276398
## monthaug 1.313931 1.374839 0.956 0.340224
## monthdec 2.022285 1.150576 1.758 0.080135 .
## monthfeb -0.043227 0.925304 -0.047 0.962780
## monthjul 0.789941 1.219786 0.648 0.517883
## monthjun -0.028542 1.132933 -0.025 0.979922
## monthmar -0.119666 0.904959 -0.132 0.894914
## monthmay 1.731173 1.647245 1.051 0.294379
## monthoct 3.860256 1.625441 2.375 0.018372 *
## monthsep 2.582194 1.532596 1.685 0.093368 .
## daymon -0.115957 0.333272 -0.348 0.728206
## daysat 0.525835 0.317920 1.654 0.099489 .
## daysun 0.440884 0.312954 1.409 0.160244
## daythu 0.088689 0.347098 0.256 0.798553
## daytue 0.609855 0.329800 1.849 0.065711 .
## daywed -0.085819 0.340491 -0.252 0.801231
## FFMC 0.020687 0.056889 0.364 0.716462
## DMC 0.009157 0.002689 3.406 0.000778 ***
## DC -0.005234 0.001983 -2.640 0.008856 **
## ISI -0.011857 0.037883 -0.313 0.754582
## temp -0.011517 0.033153 -0.347 0.728616
## RH -0.008238 0.009472 -0.870 0.385330
## wind 0.002567 0.055097 0.047 0.962878
## rain 0.098441 1.030367 0.096 0.923969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.362 on 231 degrees of freedom
## Multiple R-squared: 0.1515, Adjusted R-squared: 0.05972
## F-statistic: 1.65 on 25 and 231 DF, p-value: 0.03051
plot(mod_lin2, which=c(1,2,3))
## Warning: not plotting observations with leverage one:
## 232, 254
From the coefficients and P-values of this linear model fitting using the full dataset, we can see that DMC (p-value <0.001) and DC (p-value<0.05) are statistical significant.
From the coefficients and P-values of this linear model fitted using the dataset without the potential influential points, we can see that DMC and DC are still statistically significant, and now we have another statistically significant variable, ‘monthoct’ (p-value <0.05). Also, in this model, we have more variables with P-value close to 0.05.
We also check the LINE (Linearity, Independence, Normality, Equal variance) assumption for second linear regression model: seeing the plot of residuals vs. fitted for #2 model, the linearity and equal variance assumption of linear regression model holds here – the data points equally spread across the zero line. Seeing the normal Q-Q plot for #2 model, the normality assumption of linear regression model holds here – data point does not deviate much from the line.
table_12<- matrix(c(summary(mod_lin)$r.squared, summary(mod_lin)$adj.r.squared, sqrt(mean(mod_lin$residuals^2)),AIC(mod_lin), BIC(mod_lin),
summary(mod_lin2)$r.squared, summary(mod_lin2)$adj.r.squared, sqrt(mean(mod_lin2$residuals^2)),AIC(mod_lin2), BIC(mod_lin2)), ncol=5,nrow=2, byrow=TRUE)
colnames(table_12)<- c("R^2", "Adjusted R^2", "Square root of MSE","AIC", "BIC")
rownames(table_12)<- c("logarea ~.",
"logarea~., (w/o influential points)")
table_12 <- as.table(table_12)
kable(table_12)
| R^2 | Adjusted R^2 | Square root of MSE | AIC | BIC | |
|---|---|---|---|---|---|
| logarea ~. | 0.1254112 | 0.0358017 | 1.425126 | 1011.5275 | 1108.685 |
| logarea~., (w/o influential points) | 0.1515462 | 0.0597222 | 1.291624 | 914.8673 | 1010.692 |
We compared these two models (#1 & #2) by values of \(R^2\), adjusted \(R^2\), square root of MSE, AIC and BIC. \(R^2\) is the proportion of the overall variability in our outcome that our model is able to explain; adjusted \(R^2\) is a modified version of R2 that takes into account the number of predictors in our model; MSE is the estimated variance of the observed outcome about the fitted regression line—in other words, it estimates the amount variability in our outcome of interest that our model is unable to explain; AIC and BIC are model selection criteria.
We would prefer higher value of \(R^2\), adjusted \(R^2\), and smaller values of square root of MSE, AIC and BIC. Here, the model fitted using dataset without the influential points have larger number of \(R^2\), adjusted \(R^2\), and smaller number of square root of MSE, AIC and BIC, indicating the 2nd model fits the data better. Therefore, in the further model fitting, we will use the dataset without influential points for better prediction.
Since we have a lot of variables to be considered, we also tried linear model with forward selection and backward elimination based on AIC value.
# Linear model with forward selection & backward elimination based on AIC
full.model <- lm(logarea ~., data = dat[c(seq(1,12),15,16)])
step.forw <- step(lm(logarea~1, data=dat[c(seq(1,12),15,16)]), ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, direction = "forward")
## Start: AIC=175.77
## logarea ~ 1
##
## Df Sum of Sq RSS AIC
## + ISI 1 6.1318 499.20 174.63
## + temp 1 4.2722 501.06 175.59
## <none> 505.33 175.77
## + FFMC 1 1.8494 503.48 176.83
## + rain 1 1.0318 504.30 177.24
## + Y 1 0.7418 504.59 177.39
## + RH 1 0.6581 504.67 177.43
## + DC 1 0.6023 504.73 177.46
## + wind 1 0.3736 504.96 177.58
## + X 1 0.1436 505.19 177.69
## + DMC 1 0.0142 505.32 177.76
## + month 9 28.4319 476.90 178.89
## + day 6 16.8713 488.46 179.04
##
## Step: AIC=174.63
## logarea ~ ISI
##
## Df Sum of Sq RSS AIC
## <none> 499.20 174.63
## + RH 1 1.5252 497.68 175.84
## + rain 1 1.2029 498.00 176.01
## + Y 1 1.1403 498.06 176.04
## + DMC 1 0.9407 498.26 176.15
## + temp 1 0.9073 498.29 176.16
## + wind 1 0.6063 498.59 176.32
## + FFMC 1 0.4141 498.79 176.42
## + X 1 0.3244 498.88 176.46
## + DC 1 0.0082 499.19 176.63
## + day 6 16.7117 482.49 177.88
## + month 9 22.3065 476.89 180.88
step.back <- step(full.model, direction = "backward")
## Start: AIC=183.53
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp +
## RH + wind + rain + season.cat
##
##
## Step: AIC=183.53
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp +
## RH + wind + rain
##
## Df Sum of Sq RSS AIC
## - wind 1 0.004 428.76 181.53
## - rain 1 0.017 428.77 181.54
## - ISI 1 0.182 428.93 181.64
## - temp 1 0.224 428.98 181.67
## - FFMC 1 0.245 429.00 181.68
## - X 1 0.973 429.72 182.12
## - day 6 18.275 447.03 182.26
## - RH 1 1.404 430.16 182.37
## - Y 1 2.209 430.96 182.85
## <none> 428.75 183.53
## - month 9 37.254 466.01 186.95
## - DC 1 12.936 441.69 189.17
## - DMC 1 21.529 450.28 194.12
##
## Step: AIC=181.54
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp +
## RH + rain
##
## Df Sum of Sq RSS AIC
## - rain 1 0.020 428.78 179.55
## - ISI 1 0.184 428.94 179.65
## - temp 1 0.236 428.99 179.68
## - FFMC 1 0.242 429.00 179.68
## - X 1 0.979 429.73 180.12
## - day 6 18.335 447.09 180.30
## - RH 1 1.412 430.17 180.38
## - Y 1 2.223 430.98 180.86
## <none> 428.76 181.53
## - month 9 37.269 466.02 184.96
## - DC 1 12.932 441.69 187.17
## - DMC 1 21.779 450.53 192.27
##
## Step: AIC=179.55
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp +
## RH
##
## Df Sum of Sq RSS AIC
## - ISI 1 0.201 428.98 177.67
## - temp 1 0.223 429.00 177.68
## - FFMC 1 0.250 429.03 177.70
## - X 1 0.981 429.76 178.13
## - day 6 18.354 447.13 178.32
## - RH 1 1.395 430.17 178.38
## - Y 1 2.237 431.01 178.88
## <none> 428.78 179.55
## - month 9 37.476 466.25 183.08
## - DC 1 13.013 441.79 185.23
## - DMC 1 21.864 450.64 190.33
##
## Step: AIC=177.67
## logarea ~ X + Y + month + day + FFMC + DMC + DC + temp + RH
##
## Df Sum of Sq RSS AIC
## - FFMC 1 0.083 429.06 175.72
## - temp 1 0.267 429.24 175.83
## - X 1 0.919 429.90 176.22
## - day 6 18.312 447.29 176.41
## - RH 1 1.558 430.53 176.60
## - Y 1 2.120 431.10 176.94
## <none> 428.98 177.67
## - month 9 40.010 468.99 182.59
## - DC 1 12.934 441.91 183.30
## - DMC 1 22.914 451.89 189.04
##
## Step: AIC=175.72
## logarea ~ X + Y + month + day + DMC + DC + temp + RH
##
## Df Sum of Sq RSS AIC
## - temp 1 0.229 429.29 173.85
## - X 1 0.899 429.96 174.26
## - day 6 18.249 447.31 174.42
## - RH 1 1.723 430.78 174.75
## - Y 1 2.156 431.22 175.01
## <none> 429.06 175.72
## - month 9 39.933 468.99 180.59
## - DC 1 13.726 442.79 181.81
## - DMC 1 24.395 453.45 187.93
##
## Step: AIC=173.85
## logarea ~ X + Y + month + day + DMC + DC + RH
##
## Df Sum of Sq RSS AIC
## - day 6 18.022 447.31 172.42
## - X 1 1.008 430.30 172.46
## - RH 1 2.032 431.32 173.07
## - Y 1 2.361 431.65 173.26
## <none> 429.29 173.85
## - DC 1 14.038 443.33 180.12
## - DMC 1 24.169 453.46 185.93
## - month 9 55.884 485.17 187.31
##
## Step: AIC=172.42
## logarea ~ X + Y + month + DMC + DC + RH
##
## Df Sum of Sq RSS AIC
## - X 1 0.996 448.31 171.00
## - RH 1 1.333 448.64 171.19
## - Y 1 2.796 450.11 172.03
## <none> 447.31 172.42
## - DC 1 15.327 462.64 179.08
## - month 9 54.709 502.02 184.08
## - DMC 1 27.839 475.15 185.94
##
## Step: AIC=171
## logarea ~ Y + month + DMC + DC + RH
##
## Df Sum of Sq RSS AIC
## - RH 1 1.083 449.39 169.62
## - Y 1 1.834 450.14 170.04
## <none> 448.31 171.00
## - DC 1 14.482 462.79 177.17
## - month 9 53.721 502.03 182.08
## - DMC 1 26.874 475.18 183.96
##
## Step: AIC=169.62
## logarea ~ Y + month + DMC + DC
##
## Df Sum of Sq RSS AIC
## - Y 1 1.775 451.16 168.63
## <none> 449.39 169.62
## - DC 1 14.348 463.74 175.69
## - month 9 53.707 503.10 180.63
## - DMC 1 26.021 475.41 182.08
##
## Step: AIC=168.63
## logarea ~ month + DMC + DC
##
## Df Sum of Sq RSS AIC
## <none> 451.16 168.63
## - DC 1 13.848 465.01 174.40
## - month 9 52.829 503.99 179.09
## - DMC 1 25.059 476.22 180.52
summary(step.forw)
##
## Call:
## lm(formula = logarea ~ ISI, data = dat[c(seq(1, 12), 15, 16)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0204 -1.0208 -0.0905 0.8410 3.5457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.14044 0.21716 9.857 <2e-16 ***
## ISI -0.03826 0.02162 -1.770 0.078 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.399 on 255 degrees of freedom
## Multiple R-squared: 0.01213, Adjusted R-squared: 0.00826
## F-statistic: 3.132 on 1 and 255 DF, p-value: 0.07795
summary(step.back)
##
## Call:
## lm(formula = logarea ~ month + DMC + DC, data = dat[c(seq(1,
## 12), 15, 16)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7872 -0.9334 -0.0507 0.8514 3.6869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.840299 0.787455 2.337 0.020245 *
## monthaug 1.563758 1.231400 1.270 0.205324
## monthdec 2.225481 1.059302 2.101 0.036672 *
## monthfeb 0.247090 0.893731 0.276 0.782420
## monthjul 1.113326 1.081491 1.029 0.304290
## monthjun 0.079900 1.025060 0.078 0.937934
## monthmar 0.232653 0.844752 0.275 0.783233
## monthmay 1.958612 1.567324 1.250 0.212618
## monthoct 3.979868 1.522284 2.614 0.009492 **
## monthsep 2.833618 1.406153 2.015 0.044980 *
## DMC 0.009194 0.002492 3.689 0.000277 ***
## DC -0.005187 0.001892 -2.742 0.006553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.357 on 245 degrees of freedom
## Multiple R-squared: 0.1072, Adjusted R-squared: 0.06711
## F-statistic: 2.674 on 11 and 245 DF, p-value: 0.00293
plot(step.back, 1:3)
## Warning: not plotting observations with leverage one:
## 232
In the forward selection model, only the ISI predictor being selected, but with a not very statically significant P-value.
In the backward elimination model, more predictors are selected. Also, ‘monthdec’, ‘monthoct’, ‘monthsep’, DMC and DC are statically significant with p-values <0.05. We also check the LINE assumption for the backward elimination model. From residuals vs fitted plot, linearity holds since the residuals scatter randomly above and below the zero overall; normality holds because the residuals do not deviate from the line in the normal Q-Q plot; equal variance holds because the residuals spread out evenly above and below the zero line in the residuals vs. fitted plot in general.
Finally, we compared these three fitted linear regression models on \(R^2\), adjusted \(R^2\), square root of MSE, AIC and BIC, to see which one performs the best.
table1<- matrix(c(summary(mod_lin)$r.squared, summary(mod_lin)$adj.r.squared, sqrt(mean(mod_lin$residuals^2)),AIC(mod_lin), BIC(mod_lin),
summary(mod_lin2)$r.squared, summary(mod_lin2)$adj.r.squared, sqrt(mean(mod_lin2$residuals^2)),AIC(mod_lin2), BIC(mod_lin2),
summary(step.forw)$r.squared, summary(step.forw)$adj.r.squared, sqrt(mean(step.forw$residuals^2)),AIC(step.forw), BIC(step.forw),
summary(step.back)$r.squared, summary(step.back)$adj.r.squared, sqrt(mean(step.back$residuals^2)),AIC(step.back), BIC(step.back)), ncol=5,nrow=4, byrow=TRUE)
colnames(table1)<- c("R^2", "Adjusted R^2", "Square root of MSE","AIC", "BIC")
rownames(table1)<- c("logarea ~.",
"logarea~., (w/o influential points)",
"model with forward selection based on AIC",
"model with backward elimination based on AIC")
table1 <- as.table(table1)
kable(table1)
| R^2 | Adjusted R^2 | Square root of MSE | AIC | BIC | |
|---|---|---|---|---|---|
| logarea ~. | 0.1254112 | 0.0358017 | 1.425126 | 1011.5275 | 1108.6849 |
| logarea~., (w/o influential points) | 0.1515462 | 0.0597222 | 1.291624 | 914.8673 | 1010.6923 |
| model with forward selection based on AIC | 0.0121343 | 0.0082603 | 1.393706 | 905.9650 | 916.6122 |
| model with backward elimination based on AIC | 0.1071920 | 0.0671067 | 1.324955 | 899.9630 | 946.1009 |
From the summary table above, the forward selection model is the worst. Although the square root of MSE is a bit higher for the backward elimination model compared to the full model, the adjusted \(R^2\) is higher and AIC and BIC values are smaller.
Hence, the backward elimination model is considered the best one among these linear regression models. However, from the \(R^2\) value, this model still only explains about 10.7\(\%\) of the data. We then tried logistic regression model.
We categorize the outcome as a binary outcome – burn = 1 if burning area>0, burn = 0 if burning area =0.
# Fit Logistic Regression
lg_burn <- glm(burn ~ X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, family=binomial(link = "logit"),data=new_dat)
summary(lg_burn)
##
## Call:
## glm(formula = burn ~ X + Y + month + day + FFMC + DMC + DC +
## ISI + temp + RH + wind + rain, family = binomial(link = "logit"),
## data = new_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.59243 -1.16846 0.00038 1.11144 1.99644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.282e+00 3.485e+00 -2.090 0.0366 *
## X 5.473e-02 4.895e-02 1.118 0.2636
## Y 4.653e-02 9.349e-02 0.498 0.6187
## monthaug -5.892e-01 1.277e+00 -0.461 0.6446
## monthdec 1.683e+01 7.894e+02 0.021 0.9830
## monthfeb 6.951e-01 8.730e-01 0.796 0.4259
## monthjan -1.446e+01 1.417e+03 -0.010 0.9919
## monthjul -5.139e-01 1.129e+00 -0.455 0.6489
## monthjun -5.395e-01 1.042e+00 -0.518 0.6047
## monthmar -3.265e-01 8.120e-01 -0.402 0.6876
## monthmay 1.172e-01 1.630e+00 0.072 0.9427
## monthnov -1.591e+01 2.400e+03 -0.007 0.9947
## monthoct -1.407e+00 1.515e+00 -0.928 0.3532
## monthsep -4.984e-01 1.428e+00 -0.349 0.7270
## daymon 4.548e-02 3.439e-01 0.132 0.8948
## daysat -1.840e-02 3.277e-01 -0.056 0.9552
## daysun -4.778e-02 3.178e-01 -0.150 0.8805
## daythu -5.141e-02 3.596e-01 -0.143 0.8863
## daytue 2.252e-01 3.563e-01 0.632 0.5272
## daywed 3.045e-01 3.713e-01 0.820 0.4122
## FFMC 5.855e-02 3.780e-02 1.549 0.1214
## DMC -1.574e-03 2.830e-03 -0.556 0.5781
## DC 1.338e-03 1.921e-03 0.697 0.4860
## ISI -1.922e-02 2.959e-02 -0.649 0.5160
## temp 4.648e-02 3.448e-02 1.348 0.1777
## RH 8.667e-03 9.863e-03 0.879 0.3796
## wind 7.701e-02 5.926e-02 1.300 0.1937
## rain -1.824e+00 1.212e+00 -1.505 0.1323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 698.49 on 503 degrees of freedom
## Residual deviance: 659.75 on 476 degrees of freedom
## AIC: 715.75
##
## Number of Fisher Scoring iterations: 15
hoslem.test(new_dat$burn, fitted(lg_burn), g=10) #not a poor fit (small P-value, poor fit, H0:good fit)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: new_dat$burn, fitted(lg_burn)
## X-squared = 6.6071, df = 8, p-value = 0.5796
gof(lg_burn) #AUC=64.0%
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## chiSq df pVal
## PrI 2 2 2
## drI 4 2 4
## PrG 1 1 1
## drG 3 1 3
## PrCT 1 1 1
## drCT 3 1 3
## val df pVal
## HL chiSq 9 3 8
## mHL F 7 4 2
## OsRo Z 1 5 9
## SstPgeq0.5 Z 5 5 3
## SstPl0.5 Z 3 5 5
## SstBoth chiSq 6 2 7
## SllPgeq0.5 chiSq 4 1 4
## SllPl0.5 chiSq 2 1 6
## SllBoth chiSq 8 2 1
Given the coefficients of the first logistic regression model, we can see that other variables all being not statistically significant except for the intercept. Only temperature and rain’s P-values are the lowest and closest to the significance level. All variables in our model are continuous. Thus, due to large number of covariate patterns, we used Hosmer and Lemeshow goodness of fit (GOF) test. Given the P-value = 0.5797 > 0.05 from the GOF test, we fail to reject the null hypothesis that this is a good fit. Therefore, from Hosmer and Lemeshow GOF test, this logistic regression model seems to have a good fit. We then plotted the Receiver Operating Curve (ROC) and see the area under the curve (AUC). The larger the AUC, the better the fitting. Here, AUC = 64.0%, which is acceptable. Then, we tried to improve the model for better fitting by including quadratic terms, based on the significant variables in the linear regression model and two variables being close to the significance level from the previous logistic regression model,.
lg_burn2 <- glm(burn ~X+Y+month+day+FFMC+I(FFMC^2)+DMC+I(DMC^2)+DC+I(DC^2)+ISI+temp+I(temp^2)+RH+wind+rain+I(rain^2), family=binomial(link = "logit"), data=new_dat)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg_burn2)
##
## Call:
## glm(formula = burn ~ X + Y + month + day + FFMC + I(FFMC^2) +
## DMC + I(DMC^2) + DC + I(DC^2) + ISI + temp + I(temp^2) +
## RH + wind + rain + I(rain^2), family = binomial(link = "logit"),
## data = new_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.77127 -1.12232 0.00035 1.10094 1.59486
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.672e+01 2.402e+01 -1.112 0.2660
## X 6.841e-02 4.994e-02 1.370 0.1708
## Y 5.984e-02 9.478e-02 0.631 0.5278
## monthaug 3.866e-01 1.668e+00 0.232 0.8167
## monthdec 1.750e+01 7.858e+02 0.022 0.9822
## monthfeb 6.733e-01 8.864e-01 0.760 0.4475
## monthjan -1.481e+01 1.227e+03 -0.012 0.9904
## monthjul 5.046e-01 1.550e+00 0.326 0.7448
## monthjun 1.345e-01 1.369e+00 0.098 0.9217
## monthmar -1.803e-01 8.330e-01 -0.216 0.8287
## monthmay 3.846e-01 1.666e+00 0.231 0.8175
## monthnov -1.527e+01 2.400e+03 -0.006 0.9949
## monthoct 3.664e-01 1.816e+00 0.202 0.8401
## monthsep 4.587e-01 1.708e+00 0.269 0.7883
## daymon -8.509e-02 3.534e-01 -0.241 0.8097
## daysat 2.603e-02 3.329e-01 0.078 0.9377
## daysun -7.932e-02 3.260e-01 -0.243 0.8078
## daythu -4.474e-02 3.677e-01 -0.122 0.9032
## daytue 2.633e-01 3.702e-01 0.711 0.4768
## daywed 3.617e-01 3.821e-01 0.947 0.3439
## FFMC 5.689e-01 5.669e-01 1.004 0.3155
## I(FFMC^2) -3.156e-03 3.379e-03 -0.934 0.3502
## DMC 2.154e-02 1.222e-02 1.763 0.0780 .
## I(DMC^2) -7.148e-05 3.669e-05 -1.948 0.0514 .
## DC -6.328e-03 5.707e-03 -1.109 0.2675
## I(DC^2) 6.665e-06 5.196e-06 1.283 0.1996
## ISI -1.373e-02 3.212e-02 -0.428 0.6690
## temp -1.260e-01 1.145e-01 -1.100 0.2713
## I(temp^2) 4.519e-03 2.659e-03 1.699 0.0893 .
## RH 9.781e-03 1.041e-02 0.940 0.3475
## wind 8.665e-02 6.047e-02 1.433 0.1518
## rain -9.369e+01 4.306e+03 -0.022 0.9826
## I(rain^2) 7.591e+01 4.517e+03 0.017 0.9866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 698.49 on 503 degrees of freedom
## Residual deviance: 641.42 on 471 degrees of freedom
## AIC: 707.42
##
## Number of Fisher Scoring iterations: 15
hoslem.test(new_dat$burn, fitted(lg_burn2), g=10) #not a poor fit
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: new_dat$burn, fitted(lg_burn2)
## X-squared = 3.6565, df = 8, p-value = 0.8867
gof(lg_burn2) #AUC=66.4%
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## chiSq df pVal
## PrI 2 2 2
## drI 4 2 4
## PrG 1 1 1
## drG 3 1 3
## PrCT 1 1 1
## drCT 3 1 3
## val df pVal
## HL chiSq 6 3 7
## mHL F 2 4 6
## OsRo Z 5 5 8
## SstPgeq0.5 Z 4 5 2
## SstPl0.5 Z 3 5 5
## SstBoth chiSq 9 2 1
## SllPgeq0.5 chiSq 7 1 3
## SllPl0.5 chiSq 1 1 8
## SllBoth chiSq 8 2 4
anova(lg_burn, lg_burn2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: burn ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + RH +
## wind + rain
## Model 2: burn ~ X + Y + month + day + FFMC + I(FFMC^2) + DMC + I(DMC^2) +
## DC + I(DC^2) + ISI + temp + I(temp^2) + RH + wind + rain +
## I(rain^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 476 659.75
## 2 471 641.42 5 18.323 0.002567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the coefficients of the variables, we can see that there’re more variables close to the significance level (P-value=0.05) and much closer than those in #1. In this logistic regression model, DMC and temperature seem to be the influential and significant predictors.
We also used Hosmer and Lemeshow goodness of fit (GOF) test again for the new logistic regression model. Since the P-value=0.8867 > 0.05, fwe ail to reject the null hypothesis that this is a good fit. This time, the p-value is larger than the first model. Again, from the Hosmer and Lemeshow GOF test, the #2 logistic model also seems to be a good fit. Look at the ROC curve and AUC, now AUC=64.7%, larger than the first model. And we know that larger the AUC, the better the model fit. Thus, after adjusting for these variables, the model seems to have a better fit.
Since from the table below, logistic model #1 has a smaller BIC while logistic model #2 has a smaller AIC, thus, besides the Hosmer and Lemeshow GOF test and ROC curve, we further compared these two logistic regression models with Likelihood Ratio Test (LRT) which follows a Chi-square distribution since our two models are nested. H0: reduced model is sufficient (here, logistic #1 is sufficient) and H1: full model is preferred (here, logistic #2 is preferred). Since P-value=0.002567 < 0.05, we reject the null hypothesis that reduced model is sufficient. Therefore, logistic model #2 is preferred and we need these quadratic terms in the model.
table3<- matrix(c(AIC(lg_burn), BIC(lg_burn),
AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(table3)<- c("AIC", "BIC")
rownames(table3)<- c("logistic model #1",
"logistic model #2")
table3 <- as.table(table3)
kable(table3)
| AIC | BIC | |
|---|---|---|
| logistic model #1 | 715.7466 | 833.9788 |
| logistic model #2 | 707.4232 | 846.7682 |
Now, we have a linear regression model and a logistic regression model, the question is for these two models, which one is better? Hence, we compared the best linear with the best logistic via AIC, BIC, and we concluded that the logistic model had a better fit since it has a smaller AIC and BIC value seeing from below table and plot, and the significant factors from the regression models are X, Y, FFMC, DMC, DC, ISI, temp, RH, wind that we can take them into further machine learning prediction.
tablefinal<- matrix(c(AIC(step.back), BIC(step.back),
AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(tablefinal)<- c("AIC", "BIC")
rownames(tablefinal)<- c("linear #4: model with backward elimination based on AIC",
"logistic model #2")
tablefinal <- as.table(tablefinal)
kable(tablefinal)
| AIC | BIC | |
|---|---|---|
| linear #4: model with backward elimination based on AIC | 899.9630 | 946.1009 |
| logistic model #2 | 707.4232 | 846.7682 |
Finally, we decided to predict the occurrence of forest fire using machine learning. Based on the result from regression analysis, we selected the most significant predictors (X, Y, FFMC, DMC, DC, ISI, temp, RH, wind) from all variables in the modeling process.
Since we would like to predict a binary outcome, i.e. whether there would be forest fire or not, we used classification models including logistic machine learning model, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbors (knn), classification tree, and random forest. In the modeling process, we divided the dataset in half for training and testing sets.
y <- new_dat$burn
set.seed(1)
train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]
#Logistic Machine Learning
logistic_fit <- glm(burn~X+Y+month+day+FFMC+I(FFMC^2)+DMC+I(DMC^2)+DC+I(DC^2)+ISI+temp+I(temp^2)+RH+wind+rain+I(rain^2), train_set, family=binomial(link = "logit"))
p_hat_logit <- predict(logistic_fit, newdata=test_set)
y_hat_logit <- ifelse(p_hat_logit>0.5, 1,0)
confusionMatrix(data=as.factor(y_hat_logit), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 72 60
## 1 56 64
##
## Accuracy : 0.5397
## 95% CI : (0.476, 0.6024)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.1723
##
## Kappa : 0.0787
##
## Mcnemar's Test P-Value : 0.7806
##
## Sensitivity : 0.5625
## Specificity : 0.5161
## Pos Pred Value : 0.5455
## Neg Pred Value : 0.5333
## Prevalence : 0.5079
## Detection Rate : 0.2857
## Detection Prevalence : 0.5238
## Balanced Accuracy : 0.5393
##
## 'Positive' Class : 0
##
set.seed(1)
qda_fit <- qda(burn ~ X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
qda_preds <- predict(qda_fit, test_set)
confusionMatrix(data=as.factor(qda_preds$class), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 29 27
## 1 99 97
##
## Accuracy : 0.5
## 95% CI : (0.4366, 0.5634)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.6237
##
## Kappa : 0.0087
##
## Mcnemar's Test P-Value : 2.529e-10
##
## Sensitivity : 0.2266
## Specificity : 0.7823
## Pos Pred Value : 0.5179
## Neg Pred Value : 0.4949
## Prevalence : 0.5079
## Detection Rate : 0.1151
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.5044
##
## 'Positive' Class : 0
##
lda_fit <- lda(burn ~ X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
lda_preds <- predict(lda_fit, test_set)
confusionMatrix(data = as.factor(lda_preds$class), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 36
## 1 90 88
##
## Accuracy : 0.5
## 95% CI : (0.4366, 0.5634)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.6237
##
## Kappa : 0.0065
##
## Mcnemar's Test P-Value : 2.34e-06
##
## Sensitivity : 0.2969
## Specificity : 0.7097
## Pos Pred Value : 0.5135
## Neg Pred Value : 0.4944
## Prevalence : 0.5079
## Detection Rate : 0.1508
## Detection Prevalence : 0.2937
## Balanced Accuracy : 0.5033
##
## 'Positive' Class : 0
##
p1 <- prediction(p_hat_logit, test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")
p2 <- prediction(qda_preds$posterior[,2], test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")
p3 <- prediction(lda_preds$posterior[,2], test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")
p2 %>% plot(main="QDA ROC",xlim=c(1.5,-0.5))
p3 %>% plot(main="LDA ROC",xlim=c(1.5,-0.5))
# Logistic regression AUC
prediction(p_hat_logit, test_set$burn) %>%
performance(measure = "auc") %>%
.@y.values #0.545
## [[1]]
## [1] 0.5454259
# QDA AUC
prediction(qda_preds$posterior[,2], test_set$burn) %>%
performance(measure = "auc") %>%
.@y.values #0.510
## [[1]]
## [1] 0.5101436
# LDA AUC
prediction(lda_preds$posterior[,2], test_set$burn) %>%
performance(measure = "auc") %>%
.@y.values #0.501
## [[1]]
## [1] 0.5010711
train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]
train_set$burn <- as.factor(train_set$burn)
test_set$burn <- as.factor(test_set$burn)
# Knn
knn_fit <- knn3(burn~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set,k=5)
knn_hat <- predict(knn_fit, test_set, type = "class")
tab <- table(pred=knn_hat, truth=test_set$burn)
confusionMatrix(tab)
## Confusion Matrix and Statistics
##
## truth
## pred 0 1
## 0 71 45
## 1 57 79
##
## Accuracy : 0.5952
## 95% CI : (0.5318, 0.6564)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.003294
##
## Kappa : 0.1915
##
## Mcnemar's Test P-Value : 0.276082
##
## Sensitivity : 0.5547
## Specificity : 0.6371
## Pos Pred Value : 0.6121
## Neg Pred Value : 0.5809
## Prevalence : 0.5079
## Detection Rate : 0.2817
## Detection Prevalence : 0.4603
## Balanced Accuracy : 0.5959
##
## 'Positive' Class : 0
##
knn_prob <- predict(knn_fit, test_set, type="prob")[,2]
plot(performance(prediction(knn_prob, test_set$burn), "tpr", "tnr"),main="KNN ROC", xlim=c(1,0))
prediction(knn_prob, test_set$burn) %>%
performance(measure = "auc") %>%
.@y.values #0.607
## [[1]]
## [1] 0.6066028
plot(p1, col="red",xlim=c(1.5,-0.5), main="Logistic(red), LDA(green), QDA(blue), kNN(pink) ROC Plot")
plot(p2, add=TRUE, col="blue")
plot(p3, add=TRUE, col="green")
plot(performance(prediction(knn_prob, test_set$burn), "tpr", "tnr"),add=TRUE,col="hot pink")
Machine Learning Summary
Among these four models, kNN with k=5 has the greatest accuracy and the most balanced sensitivity and specificity. It also has the highest AUC value. The logistic model has slightly lower values for these parameters but still pretty balanced. Therefore, we chose kNN and logistic model over the other two.
## Classification tree, Random Forest
set.seed(1)
# Classification Tree
train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]
train_set$burn <- as.factor(train_set$burn)
test_set$burn <- as.factor(test_set$burn)
tree_fit <- tree(burn ~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data=train_set)
summary(tree_fit)
##
## Classification tree:
## tree(formula = burn ~ X + Y + FFMC + DMC + DC + ISI + temp +
## RH + wind, data = train_set)
## Variables actually used in tree construction:
## [1] "DMC" "X" "RH" "temp" "wind"
## Number of terminal nodes: 9
## Residual mean deviance: 1.147 = 278.8 / 243
## Misclassification error rate: 0.3095 = 78 / 252
pred <- predict(tree_fit, test_set,type = "class")
confusionMatrix(as.factor(pred), reference = test_set$burn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 17
## 1 108 107
##
## Accuracy : 0.504
## 95% CI : (0.4405, 0.5673)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.5751
##
## Kappa : 0.0189
##
## Mcnemar's Test P-Value : 8.29e-16
##
## Sensitivity : 0.15625
## Specificity : 0.86290
## Pos Pred Value : 0.54054
## Neg Pred Value : 0.49767
## Prevalence : 0.50794
## Detection Rate : 0.07937
## Detection Prevalence : 0.14683
## Balanced Accuracy : 0.50958
##
## 'Positive' Class : 0
##
plot(tree_fit, type = "uniform", main="Classification treee")
text(tree_fit, cex = 1)
cv_tree <- cv.tree(tree_fit)
plot(cv_tree)
best.size <- cv_tree$size[which(cv_tree$dev==min(cv_tree$dev))] # which size is better?
best.size
## [1] 1
prune_fit <- prune.tree(tree_fit, best = 4)
plot(prune_fit, type="uniform", main="Classification tree with best=4")
text(prune_fit)
# Random Forest
rf <- randomForest(burn~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
rf
##
## Call:
## randomForest(formula = burn ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + wind, data = train_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 39.29%
## Confusion matrix:
## 0 1 class.error
## 0 58 61 0.5126050
## 1 38 95 0.2857143
preds_rf <- predict(rf, test_set)
confusionMatrix(as.factor(preds_rf), reference = test_set$burn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56 48
## 1 72 76
##
## Accuracy : 0.5238
## 95% CI : (0.4602, 0.5869)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 0.32974
##
## Kappa : 0.0503
##
## Mcnemar's Test P-Value : 0.03576
##
## Sensitivity : 0.4375
## Specificity : 0.6129
## Pos Pred Value : 0.5385
## Neg Pred Value : 0.5135
## Prevalence : 0.5079
## Detection Rate : 0.2222
## Detection Prevalence : 0.4127
## Balanced Accuracy : 0.5252
##
## 'Positive' Class : 0
##
varImpPlot(rf,main = "Random Forest Important Variable Plot")
Using the cv.tree function, we determined the optimal number of terminal nodes to be 4. Although from the plot, the size with the lowest deviance is 1 and 2, it’s more reasonable to choose 4 with very slightly difference in deviance. For this model, the accuracy is 0.504, while the sensitivity (0.1563) and specificity (0.8629) are extremely unbalanced. We would not consider it a good model to predict forest fires.
For the random forest model, the accuracy is 0.5238, which is slightly lower than the logistic model and kNN model. The sensitivity (0.4375) and specificity is not quite balanced (0.6129), but compared with LDA and QDA models, it’s being much more balanced. Thus, this model is also considered suitable for classifying the outcome of forest fire.
From the Shiny app and plots, forest fires are most common during August and September, and most frequently occur at temperature around 20°C and relative humidity around 40%. The distribution for wind speed is more spread out. Since the frequency and burning area of forest fires does not vary significantly between during weekdays and weekends, human activity does not have a great impact on forest fires.
Although the number of occurrence of forest fires are not linearly associated with weather observations, forest fires of large scale are more likely to occur under high temperature and low humidity.
For regression analysis, from previous analysis, backward elimination based on AIC linear regression model is the best among linear regression models and second logistic model is the best among two logistic models. After comparing these two models, we concluded that the second logistic regression model is the best fitting one from the below table and plot.
tablefinal<- matrix(c(AIC(step.back), BIC(step.back),
AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(tablefinal)<- c("AIC", "BIC")
rownames(tablefinal)<- c("linear #4: model with backward elimination based on AIC",
"logistic model #2")
tablefinal <- as.table(tablefinal)
kable(tablefinal)
| AIC | BIC | |
|---|---|---|
| linear #4: model with backward elimination based on AIC | 899.9630 | 946.1009 |
| logistic model #2 | 707.4232 | 846.7682 |
Linear vs Logistic
By comparing the P-values, we ranked the most significant variables as following: X, Y, FFMC, DMC, DC, ISI, temp, RH, wind. Using these factors, we performed machine learning to predict binary forest fire outcome.
For machine learning models, we considered kNN model, logistic model and random forest the best models to predict forest fires because they have the highest accuracy value and most balanced specificity and sensitivity.
What’s more, seeing the random forest most important variables plot, relative humidity (RH), temperature, Duff Moisture Code (DMC) and Drought Code (DC) are the top 4, which are in accordance with the results from visualization. So in the future, the fire department could take a closer look at these variable measurements and prepare accordingly.